9 research outputs found

    Characterizing VNTRs in human populations

    Get PDF
    Over half the human genome consists of repetitive sequences. One major class is the tandem repeats (TRs), which are defined by their location in the genome, repeat unit, and copy number. TRs loci that exhibit variant copy numbers are called Variable Number Tandem Repeats (VNTRs). High VNTR mutation rates of approximately 0.0001 per generation make them suitable for forensic studies, and of interest for potential roles in gene regulation and disease. TRs are generally divided into three classes: 1) microsatellites or short tandem repeats (STRs) with patterns 100 bp. To date, mini- and macrosatellites have been poorly characterized, mainly due to a lack of computational tools. In this thesis, I utilize a tool, VNTRseek, to identify human minisatellite VNTRs using short-read sequencing data from nearly 2,800 individuals and developed a new computational tool, MaSUD, to identify human macrosatellite VNTRs using data from 2,504 individuals. MaSUD is the first high-throughput tool to genotype macrosatellites using short reads. I identified over 35,000 minisatellite VNTRs and over 4,000 macrosatellite VNTRs, most previously unknown. A small subset in each VNTR class was validated experimentally and in silico. The detected VNTRs were further studied for their effects on gene expression, ability to distinguish human populations, and functional enrichment. Unlike STRs, mini- and macrosatellite VNTRs are enriched in regions with functional importance, e.g., introns, promoters, and transcription factor binding sites. A study of VNTRs across 26 populations shows that minisatellite VNTR genotypes can be used to predict super-populations with >90% accuracy. In addition, genotypes for 195 minisatellite VNTRs and 22 macrosatellite VNTRs were shown to be associated with differential expression in nearby genes (eQTLs). Finally, I developed a computational tool, mlZ, to infer undetected VNTR alleles and to detect false positive predictions. mlZ is applicable to other tools that use read support for predicting short variants. Overall, these studies provide the most comprehensive analysis of mini- and macrosatellites in human populations and will facilitate the application of VNTRs for clinical purposes

    Algorithms for the discovery of large genomic inversions using pooled clone sequencing

    Get PDF
    Cataloged from PDF version of article.An inversion is a chromosomal rearrangement in which an internal segment of a chromosome has been broken twice, flipped 180 degrees, and rejoined. Most known examples of large inversions were found indirectly from studies on human disease where inversions have no detectable effect in parents, but increase the risk of a disease-associated rearrangement in the offspring. The development of a map of inversion polymorphisms will provide valuable information regarding their distribution and frequency in the human genome and will help unravel how inversions and the segmental duplications architecture associated with inverted haplotypes contribute to genomic susceptibility to disease rearrangements. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, reducing the mappability of short reads. We propose using pooled clone sequencing (PCS), a method originally developed to improve haplotype phasing, to characterize large genomic inversions. PCS merges the advantages of clone based sequencing approaches with the speed and cost efficiency of HTS technologies. Using this sequencing data, we developed a novel algorithm, dipSeq for discovering large inversions (>500 Kbp) following the observation that clones that span the inversion breakpoint will be split into two sections, split clones, when mapped to the reference genome. We evaluate the performance of dipSeq on 3 sets of simulated data, demonstrating its correctness and robustness to structural duplications and other types of structural variations. We further applied dipSeq to the genome of a HapMap individual (NA12878). dipSeq was able to accurately discover all previously known and experimentally validated large inversions. We also identified a new inversion and confirmed using fluorescent in situ hybridization. Although dipSeq displays a relatively high false positive rate using real data, it performed better with simulated data, suggesting that the performance with the NA12878 genome may be improved with higher depth of coverage.Rasekh, Marzieh EslamiM.S

    Discovery of large genomic inversions using long range information.

    Get PDF
    BackgroundAlthough many algorithms are now available that aim to characterize different classes of structural variation, discovery of balanced rearrangements such as inversions remains an open problem. This is mainly due to the fact that breakpoints of such events typically lie within segmental duplications or common repeats, which reduces the mappability of short reads. The algorithms developed within the 1000 Genomes Project to identify inversions are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies.ResultsHere we propose a novel algorithm, VALOR, to discover large inversions using new sequencing methods that provide long range information such as 10X Genomics linked-read sequencing, pooled clone sequencing, or other similar technologies that we commonly refer to as long range sequencing. We demonstrate the utility of VALOR using both pooled clone sequencing and 10X Genomics linked-read sequencing generated from the genome of an individual from the HapMap project (NA12878). We also provide a comprehensive comparison of VALOR against several state-of-the-art structural variation discovery algorithms that use whole genome shotgun sequencing data.ConclusionsIn this paper, we show that VALOR is able to accurately discover all previously identified and experimentally validated large inversions in the same genome with a low false discovery rate. Using VALOR, we also predicted a novel inversion, which we validated using fluorescent in situ hybridization. VALOR is available at https://github.com/BilkentCompGen/VALOR
    corecore